Feature Selection: An Exploration of Algorithm Performance

Jason Case, Abhishek Dugar, Daniel Nkuah, Khoa Tran

2024-07-30

Introduction

What is feature selection? Why do we care?

  • Feature selection is a crucial step in data preprocessing, especially in machine learning and statistical modeling. It involves selecting a subset of relevant features (variables, predictors) for building a model.
  • It is important because it reduces the risk of overfitting and enhances the performance of machine learning models and pattern recognition systems.

Introduction: Early Techniques

Forward, backward, and stepwise variable selection in linear models

  • The forward method starts with no variables in the model, and we will add them one by one until no improvement is shown.
  • Backward is when we start with all variables and remove the least significance iteratively.
  • Stepwise selection combines both approaches, iteratively adding and removing variables to optimize the models performance.

Introduction: Early Techniques

Univariate screening procedures (USP)

  • USP represents an important milestone for feature selection.
  • These methods involve evaluating each predictor variable individually to determine its relationship with the target variable. The process selects variables that meet a specific statistical threshold.
  • While these methods are simple and quick to use, they often miss the complex connections between variables.

Introduction: Modern Techniques

Advanced feature selection methods

  • Similarity-based approaches select features based on their similarity or dissimilarity.
  • Information-theoretical-based approaches use concepts like entropy and mutual information to evaluate feature importance.
  • Sparse-learning-based approaches focus on identifying a small, essential set of features by limiting the number of features selected.
  • Statistical-based approaches utilize statistical tests and models to determine the significance of features for predicting the target variable.

Introduction: Modern Techniques

Classification of methods

  • The filter model selects features based on the general properties of the training data, independent of any learning algorithm, making it computationally efficient for large datasets.
  • The wrapper model, on the other hand, uses a specific learning algorithm to evaluate and determine which features to keep, often leading to better performance but at a higher computational cost.
  • Embedded methods perform feature selection during the model training process, integrating selection directly with learning.
  • Hybrid methods combine the best aspects of filter and wrapper methods to achieve optimal performance with manageable computational complexity.

Methods: Correlation - Based Feature Selection (CFS)

  • Filter method
  • Uses the correlation coefficient to measure the relationship between each variable and the target, independently.
  • Features that are highly correlated are considered redundant.

CFS’s feature subset evaluation function is:

\(\text{Ms} = \frac{k\bar{\text{rcf}}}{\sqrt k + k(k - 1)\bar{\text{rff}}}\)

Methods: Recursive Feature Elimination (RFE)

  • Wrapper method.
  • Removes variables iteratively.

Steps:

  1. Train the classifier.
  2. compute the ranking criterion for all features.
  3. remove the feature(s) with smallest ranking values.
  4. Repeat until desired number of features selected.
  • optimal subset of features - one that provides the highest accuracy

Methods: Least Absolute Shrinkage and Selection Operator (LASSO)

  • Embedded method
  • LASSO finds the coefficient of features
  • Uses L1 Regularization to shrink coefficients to zero,

The LASSO estimate is defined by the solution to the l1 optimization problem

minimize \(\frac{\| Y - X\beta \|_2^2}{n}\) subject to \(X \sum_{j=1}^{k} \|\beta_j\|_1 < t\)

where \(t\) is the upper bound for the sum of the coefficients

Methods: CFS & RFE

  • Hybrid method.
  • Combines filter and wrapper methods.

Step 1: Filtering using the correlation coefficient. (e.g. trim the “low hanging fruit”)

Step 2: Remove remaining variables iteratively using RFE.

Analysis: Spambase Dataset

  • 4601 instances of emails
  • 57 features for classification tasks
  • Binary classification: email is spam (1) or not (0)
  • 80/20 train/test split
  • Increased number of features by adding all \((\binom{57}{2} = 1,596)\) two-way interactions

Analysis: Spambase Dataset

Figure 1. Frequency of spam targets.

Analysis: COVID-19 NLP Text Classification Dataset

  • 45k tweets related to COVID-19, labeled for sentiment analysis.
  • Five Sentiment label classes, ranging from extremely positive to extremely negative
  • Recoded to a binary classification task, positive (1) or negative (0)
  • 33,444 (91%) training and 3,179 (9%) testing records after recoding

Analysis: COVID-19 NLP Text Classification Dataset

Figure 2. Visualization of common words.

Analysis: COVID-19 NLP Text Classification Dataset

Figure 3. Frequency of sentiment targets.

Analysis: COVID-19 NLP Text Classification Dataset

“Bag of words”

  • Text data converted into a matrix of word frequencies
  • Each row represents a document
  • Each column represents a unique word from the entire corpus
  • Large (several thousand variables), sparse (> 99% of values = 0) feature set

Analysis: COVID-19 NLP Text Classification Dataset

Figure 4. Distribution of words per Tweet.

Statistical Modeling

Three metrics:

  • Accuracy on the test set
  • difference between accuracy on the training and test set (overtraining)
  • number of variables selected (model complexity)

5 models:

  • Baseline Full logistics model with no feature selection
  • CFS Select best of 20 correlation thresholds using cross validation
  • RFE Select best of 20 sized subsets using cross validation
  • LASSO Select best penalty term using cross validation
  • CFS + RFE Remove 20% of variables with lowest correlation, Select best of 20 sized subsets using cross validation

Statistical Modeling

Figure 5. Example of cross-validation accuracy with CFS model for the Sentiment task.

Statistical Modeling

Figure 6. Example of coefficient behavior with LASSO model for the spam task.

Results: Spambase Dataset

Method Number of Features Test Accuracy Score Accuracy Decrease from Train
Baseline 1653 0.915 0.070
CFS 126 0.936 -0.004
RFE 1144 0.911 0.067
LASSO 50 0.903 -0.006
CFS + RFE 683 0.920 0.045

Results: COVID-19 NLP Text Classification Dataset

Method Number of Features Test Accuracy Score Accuracy Decrease from Train
Baseline 4820 0.846 0.097
CFS 1449 0.868 0.042
RFE 2980 0.863 0.059
LASSO 4472 0.880 0.045
CFS + RFE 2672 0.778 0.088

Results: Records per Feature

Summary

Conclusion